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Abstract 

In this paper we consider estimation of sparse covariance matrices and propose a 
thresholding procedure which is adaptive to the variability of individual entries. The 
estimators are fully data driven and enjoy excellent performance both theoretically 
and numerically. It is shown that the estimators adaptively achieve the optimal rate 
of convergence over a large class of sparse covariance matrices under the spectral 
norm. In contrast, the commonly used universal thresholding estimators are shown 
to be sub-optimal over the same parameter spaces. Support recovery is also discussed. 
The adaptive thresholding estimators are easy to implement. Numerical performance 
of the estimators is studied using both simulated and real data. Simulation results 
show that the adaptive thresholding estimators uniformly outperform the universal 
thresholding estimators. The method is also illustrated in an analysis on a dataset 
from a small round blue-cell tumors microarray experiment. A supplement to this 
paper which contains additional technical proofs is available online. 
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1 Introduction 

Let X = {Xi, . . . , Xp) be a p-variate random vector with covariance matrix Sq. Given 
an independent and identically distributed (i.i.d.) random sample {Xi, . . . ,X„} from the 
distribution of X, we wish to estimate the covariance matrix Sq under the spectral norm. 
This covariance matrix estimation problem is of fundamental importance in multivariate 
analysis with a wide range of applications. The high dimensional setting, where the di- 
mension p can be much larger than the sample size n, is of particular current interest. 
In such a setting, conventional methods and results based on fixed p and large n are no 
longer applicable and new methods and theories are thus needed. In particular, the sample 
covariance matrix 



where X = n~^'^2=i^k, performs poorly in this setting and structural assumptions are 
required in order to estimate the covariance matrix consistently. 

In this paper we focus on estimating sparse covariance matrices. This problem has 
been considered in the literature. El Karoui (2008) and Bickel and Levina (2008) proposed 
thresholding of the sample covariance matrix S„ and obtained rates of convergence for the 
thresholding estimators. Rothman, Levina and Zhu (2009) considered thresholding of the 
sample covariance matrix with more general thresholding functions. Cai and Zhou (2009 
and 2010) established the minimax rates of convergence under the matrix ii norm and the 
spectral norm. Wang and Zou (2010) considered estimation of volatility matrices based on 
high-frequency financial data. 

A common feature of the thresholding methods for sparse covariance matrix estimation 
proposed in the literature so far is that they all belong to the class of "universal thresholding 
rules". That is, a single threshold level is used to threshold all the entries of the sample 
covariance matrix. Universal thresholding rules were originally introduced by Donoho and 




(1) 



k=l 
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Johnstone (1994 and 1998) for estimating sparse normal mean vectors in the context of 
wavelet function estimation. See also Antoniadis and Fan (2001). An important feature 
of problems considered there is that noise is homoscedastic. In such a setting, universal 
thresholding has demonstrated considerable success in nonparametric function estimation 
in terms of asymptotic optimality and computational simplicity. 

In contrast to the standard homoscedastic nonparametric regression problems, sparse 
covariance matrix estimation is intrinsically a heteroscedastic problem in the sense that the 
entries of the sample covariance matrix could have a wide range of variability. Although 
some universal thresholding rules have been shown to enjoy desirable asymptotic proper- 
ties, this is mainly due to the fact that the parameter space considered in the literature 
is relatively restrictive which forces the covariance matrix estimation problem to be an 
essentially homoscedastic problem. 

To illustrate the point, it is helpful to consider an idealized model where one observes 

yi = + liZi, Zi ~ iV(0, 1) 1 <i <p (2) 

and wishes to estimate the mean vector /i which is assumed to be sparse. If the noise levels 
7j are bounded, say by B, then the universal thresholding rule fii = y.j/(|?/.j| > By/2 logp) 
performs well asymptotically over a standard £q ball 6g(so) defined by 

p 

e,iso) = {fieR^:J2\f^j\'<''^o}. (3) 

i=i 

In particular, 6o(so) is a set of sparse vectors with at most sq nonzero elements. Here the 
assumption that 7i are bounded by B is crucial. The universal thresholding rule simply 
treats the heteroscedastic problem (j2]) as a homoscedastic one with all noise levels 7^ = B. 
It is intuitively clear that this method does not perform well when the range of 7^ is large 
and it fails completely without the uniform boundedness assumption on the 7j's. 

For sparse covariance matrix estimation, the following uniformity class of sparse matri- 
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ces was considered in Bickel and Levina (2008) and Rothman, Levina and Zhu (2009): 

p 

Uq := Ug{so{p)) = |s : S 0, maxau < K, max^\aij\'' < So(p)| 

i=i 

for some < g < 1, where >~ means that "S is positive definite. Here each column of a 
covariance matrix in h{q{so{p)) is assumed to be in the £g ball 6g(so(p)). Define 

%:=Var((X,-/i,)(X,-/i,)), (4) 

where /ij = EXj. It is easy to see that in the Gaussian case, CTucrjj < % < 2aiiajj. 
The condition maxj an < K for all i ensures the variances of the entries of the sample 
covariance matrix to be uniformly bounded. Bickel and Levina (2008) proposed a universal 
thresholding estimator = (a^), where 

^ij = ^ijH^ij > ^n}, (5) 

and showed that with a proper choice of the threshold A„ the estimator achieves a 
desirable rate of convergence under the spectral norm. Rothman, Levina and Zhu (2009) 
considered a class of universal thresholding rules with more general thresholding functions 
than hard thresholding. Similar to the idealized model (|2]) discussed earlier, here a key 
assumption is that the variances an are uniformly bounded by K which is crucial to make 
the universal thresholding rules well behaved. A universal thresholding rule in this case 
essentially treats the problem as if all an = K when selects the threshold A. 

For heteroscedastic problems such as sparse covariance matrix estimation, it is arguable 
more desirable to use thresholds that capture the variability of individual variables instead 
of using a universal upper bound. This is particularly true when the variances vary over 
a wide range or no obvious upper bound on the variances is known. A more natural 
and effective approach is to use thresholding rules with entry-dependent thresholds which 
automatically adapt to the variability of the individual entries of the sample covariance 
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matrix. The main goal of the present paper is to develop such an adaptive thresholding 
estimator and study its properties. 

In this paper we introduce an adaptive thresholding estimator S = (o"* )pxp with 

^tj = (6) 

where sx{z) is a general thresholding function similar to those used in Rothman, Levina 
and Zhu (2009) and will be specified later. The individual thresholds Ay are fully data- 
driven and adapt to the variability of individual entries of the sample covariance matrix 
S„. It is shown that the adaptive thresholding estimator S enjoys excellent properties 
both asymptotically and numerically. In particular, we consider the performance of the 
estimator S over a large class of sparse covariance matrices defined by 

p 

U; := U;{s^{p)) = |S : S ^ 0, max J](a,,a,,)(i-'')/V^.f < ^ob)} (7) 

j=i 

for < g < 1. In comparison to V(q{sQ{p)), the columns of a covariance matrix in U* 
are required to be in a weighted iq ball, instead of a standard ig ball, with the weight 
determined by the variance of the entries of the sample covariance matrix. A particular 
feature of U* is that it no longer requires the variances an to be uniformly bounded and 
allows maxj (Tjj — )■ oo. Note that Ug{so{p)) C U*{K^~'^sq{p)), so the parameter space U* 
contains the uniformity class Uq as a subset. The parameter space U* can also be viewed 
as a weighted iq ball of correlation coefficients. See Section 13.11 for more discussions. 
It will be shown in Section |3] that S achieves the optimal rate of convergence 

logp ^ 



soip) 



n 



over the parameter space U*{sq{p)). In comparison, it is also shown that the best universal 
thresholding estimator can only attain the rate Sq~'^(p) (^^)''^ ^''''^ over W*(so(p)), which 
is clearly sub-optimal when Sq{j>) — > oo since g < 1. 
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The choice of the regularization parameters is important in any regularized estimation 
problem. The thresholds Xij used in ([6]) are based on an estimator of the variance of the 
entries aij of the sample covariance matrix. More specifically, Xij are of the form 



can be taken as fixed at 5 = 2, or it can be empirically chosen through cross validation. 
Theoretical properties of the resulting covariance matrix estimators using both methods 
are investigated. It is shown that the estimators attain the optimal rate of convergence 
under the spectral norm in both cases. In addition, support recovery of a sparse covariance 
matrix is also considered. 

The adaptive thresholding estimators are easy to implement. Numerical performance 
of the estimators is investigated using both simulated and real data. Simulation results 
show that the adaptive thresholding estimators perform favorably in comparison to existing 
methods. In particular, they uniformly outperform the universal thresholding estimators 
in the simulation studies. The procedure is also applied to analyze a dataset from a small 
round blue-cell tumors microarray experiment (Khan et al., 2001). 

The paper is organized as follows. Section [2] introduces the adaptive thresholding proce- 
dure for sparse covariance matrix estimation. Asymptotic properties are studied in Section 
131 It is shown that the adaptive thresholding estimator is rate-optimal over U*, while 
the best universal thresholding estimator is proved to be suboptimal. Section |4] discusses 
data-driven selection of the thresholds using cross validation and establish asymptotic op- 
timality of the resulting estimator. Numerical performance of the adaptive thresholding 
estimators is investigated by simulations and by an application to a dataset from a small 
round blue-cell tumors microarray experiment in Section |5l Section [6] discusses methods 
based on the sample correlation matrix. The proofs are given in Section [71 




(8) 



where 6ij are estimates of 6ij defined in (|4]) and 5 is a tuning parameter. The value of S 
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2 Adaptive thresholding for sparse covariance matrix 

In this section we introduce the adaptive thresholding method for estimating sparse co- 
variance matrices. To motivate our estimator, consider again the sparse normal mean 
estimation problem ([2]). If the noise levels 7j's are known or can be well estimated, a good 
estimator of the mean vector is the hard thresholding estimator /tj = yil{\yi\ > \ogp} 
or some generalized thresholding estimator with the same thresholds 7jA/2 log p. 

Similarly, for sparse covariance matrix estimation, a more effective thresholding rule 
than universal thresholding is the one which adapts to the variability of the individual 
entries of the sample covariance matrix. Define 6ij as in (jlj). Then, roughly speaking, 
estimation of a sparse covariance matrix is similar to the mean vector estimation problem 
based on the observations 



with Zij being asymptotically standard normal. This analogy provides a good motivation 
for our adaptive thresholding procedure. If the 6ij were known, a natural thresholding 
estimator would be {d-°j)pxp with 



where sx{z) is a thresholding function. Comparing to the universal thresholding rule in 
Bickel and Levina (2008), the variance factors % in the thresholds make the thresholding 
rule entry-dependent and leads to a more flexible estimator. In practice, 6ij are typically 
unknown, but can be well estimated. We propose the following estimator of Oij: 




(9) 




(10) 



n 



-^[(X,,-XO(X,,-X^) 



X' = n 



-1 




k=l 



k=l 



This leads to our adaptive thresholding estimator of the covariance matrix Sq, 



^\^) = i^tj)pxp with a*j = sx,^{a,j), 



(11) 
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where 

\r-=\A^) = 5^^^^. (12) 

Here 5 > is a regularization parameter. It can be fixed at 5 = 2 or can be chosen through 
cross vahdation. Good choices of 5 will not affect the rate of convergence, but will affect 
the numerical performance of the resulting estimators. Selection of 5 is thus of practical 
importance and we will discuss it further later. 

The analogy between the sparse covariance estimation problem and the idealized mean 
estimation problem ([9]) gives a good motivation for the adaptive thresholding estimator 
defined in ( ITTi) and ( fT2l) . but of course the matrix estimation problem is not exactly equiv- 
alent to the mean estimation problem ([9]) as noise is not exactly normal or iid and the 
loss is the spectral norm, not a vector norm or the Frobenius norm. We shall make our 
technical analysis precise in Sections |3] and [71 

In the present paper, we consider simultaneously a class of thresholding functions s\{z) 
that satisfy the following conditions: 

(i) . |sa(2;)| < c\y\ for all z,y satisfy \z — y\ < A and some c > 0; 

(ii) . s\{z) = for \z\ < A; 

(iii) . \sx{z) — z\ < X, for all z & M. 

These three conditions are satisfied, for example, by the soft thresholding rule sx{z) = 
sgn{z){z — A)+ and the adaptive lasso rule sx{z) = z{l — \X/z\^)+ with 77 > 1, as called 
in Rothman, Levina and Zhu (2009). We shall present a unified analysis of the adaptive 
thresholding estimators with the thresholding function Sx{z) satisfying the above three 
conditions. It should be noted that Condition (i) excludes the hard thresholding rule. 
However, all of the theoretical results in this paper hold for the hard thresholding estimator 
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under similar conditions. Here Condition (i) is in place only to make the technical analysis 
in Section [7] work in a unified way for the class of thresholding rules. The results for the 
hard thresholding rule require slightly different proofs. 

Rothman, Levina and Zhu (2009) proposed generalized universal thresholding estima- 
tors 

^9 = i^!j)pxp, where afj = sx^i&ij) 

and sx{z) satisfies (ii), (iii) and |sa(2;)| < \z\, which is slightly weaker than (i). Similar 
general universal thresholding rules were introduced and studied by Antoniadis and Fan 
(2001) in the context of wavelet function estimation. We should note that the generalized 
universal thresholding estimators suffer the same shortcomings as those of S^, and like 
they are sub-optimal over the class U*. 

3 Theoretical properties of adaptive thresholding 

We now consider the asymptotic properties of the adaptive thresholding estimator S (6) 
defined in f|TT]) and f fT2|) . It is shown that the estimator S (6) adaptively attains the optimal 
rate of convergence over the collection of parameter spaces U*{so{p)). 
We begin with some notation. Define the standardized variables 

= (X,-^,)/(Var(X,))i/2^ 

where /ij = EXj, and let Y = {Yi, . . . , Yp). Throughout the paper, denote |a|2 = \jY7j=i ^"j 
for the usual Euclidean norm of a vector a = (oi, . . . , ap)^ G For a matrix A = (aij) G 
IW^'', define the spectral norm ||A||2 = sup|x|2<i |Ax|2, the matrix ii norm = 
maxi<j<g ^^^j^ |ajj|, and the Frobenius norm \\A\\f = \J^~~o^j- For two sequences of 
real numbers and write a„ = 0{bn) if there exists a constant C such that 

l^-nl < C\bn\ holds for all sufficiently large n, and write a„ = o(6„) if lim„_^oo o„/6„ = 0. 
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3.1 Rate of convergence 

It is conventional in the covariance matrix estimation literature to divide the technical 
analysis into two cases according the the moment conditions on X. 

(CI). (Exponential-type tails) Suppose that logp = o{n^^^) and there exists some 
?7 > such that 

Eexp (^tY^^^ <Ki<oo for all \t\ < t] and i. (13) 
Furthermore, we assume that for some Tq > 0, 

mmyar{YiYj) > To. (14) 

«i 

(C2). (Polynomial-type tails) Suppose that for some 7, Ci > 0, p < Cin'^, and for 
some e > 

Furthermore, we assume that ( 1T4|) holds. 

Remark 1 Note that (CI) holds with tq = 1 in the Gaussian case where X ~ N{fj,, Sq). 
To this end, let pij be the correlation coefficient of Yi and Yj. We can then write Yi = pijYj + 
— Pij^^ where Y ~ A^(0, 1) is independent of Yj. So we have \/ar(YjYj) = 1 + p'^j > 1. 
Hence ( !T^ holds with tq = 1. 

The follow theorem gives the rate of convergence over the parameter space U* under 
the spectral norm for the thresholding estimator S (6). 

Theorem 1 Let 5 > 2 and < q < 1. 

(i). Under (CI), we have, for some constant CKi.5,c,q depending only on 6, c, q and Ki, 
inf p(||r(5)-I]o||2<Cx,Ac,,So(p)(^)^) >l-0((logp)-V+2)^ ^^g) 
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(ii). Under (C2), ( fl^) holds with probability greater than 1 — 0((logp) ^^^p ^"""^ + n ^/^). 

Although is larger than the uniformity class Uq, the rates of convergence of S (6) 
over the two classes are of the same order SQ{p){\ogp/n)^^~''^^'^ . 

Theorem [T] states the rate of convergence in terms of probability. The same rate of 
convergence holds in expectation with some additional mild assumptions. By (ITB]) and some 
long but elementary calculations (see also the proof of Lemma Hj), we have the following 
result on the mean squared spectral norm. 

Proposition 1 Under (CI) andp > for some > we have for 5 > 7+.^""'^, < g < 1, 
and some constant C > 0, 

sup E\\±\5) - Soll^ < Csl{p)(^^\'\ (17) 

Remark 2 Cai and Zhou (2010) established the minimax rates of convergence under the 
spectral norm for sparse covariance matrix estimation over Uq. It was shown that the 
optimal rate over Uq is so(p)(logp/n)(^~'^^/^. Since Uq{so{p)) C U*(K^~'^so{p)), this implies 
immediately that the convergence rate attained by the adaptive thresholding estimator 
over U* in Theorem [T] and f[T7|l is optimal. 

Remark 3 The estimator S (6) yields immediately an estimate of the correlation matrix 
Ro = {f^ij)i<i,j,<p which is the object of direct interest in some statistical applications. 
Denote the corresponding estimator of by R (6) = {f*j)i<ij,<p with f*- = a*-/ ^Jaii^jj. 
A parameter space for the correlation matrices is the following Iq ball: 

V 

ll\ := 7^;(so(p)) = \^R:RyO, max^ Ir^.f < So(p)}. (18) 

i=i 

Then Theorem [T] holds for estimating the correlation matrix Rq by replacing S (5), Sq 
and U* with R (6), Rq and TZ*, respectively. 
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Note that the covariance matrix Hq can be written as Hq = D^'^RqD^/'^, where D = 

diag(So)- The covariance matrix can thus be viewed as a weighted version of the correlation 

matrix with weights {icruajjY^'^}. Correspondingly, the parameter space U* in ([7]) can be 

viewed as the weighted version of TZ* given in (|T8|) with the same weights, 

p 

Ul := {S : S ^ 0, max ^((T,,(T,,)i/>i,f < So(p)}. 
3.2 Support recovery 

A closely related problem to estimating a sparse covariance matrix under spectral norm is 
the recovery of the support of the covariance matrix. This problem has been considered, 
for example, in Rothman, Levina and Zhu (2009). For support recovery, it is natural to 
consider the parameter space 

p 

Uo :=Wo(so(p)) = {S : maxj^/jay ^ 0} < So(p)}, 

i=i 

which assumes that the covariance matrix has at most Sq{p) nonzero entries on each row. 

Define the support of Sq = (cr° ) by \E' = j) : 7^ 0}. The following theorem 
shows that the adaptive thresholding estimator S (6) recovers the support \1' exactly with 
high probability when the magnitudes of nonzero entries are above certain threshold. 

Theorem 2 Suppose that So G Wq- Let 6 > 2 and 

|a°| > (2 + (5 + 7)W ^'^'^"^^ for all E"^ and some > 0. (19) 
\ n 

If either (CI) or (C2) holds, then we have 

inf pfsupp(S*(5)) = supp(So)') ^ 1. 

Similar support recovery result was established for the generalized universal thresholding 
estimator in Rothman, Levina and Zhu (2009) under the condition maxj a^^ < K and a lower 
bound condition similar to ( IT9l) . Note that in Theorem |2l we do not require maxj an < K. 
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Following Rothman, Levina and Zhu (2009), the ability to recover the support can 
be evaluated via the true positive rate (TPR) in combination with the false positive rate 
(FPR), defined respectively as 

#{(^,j):ar,7^0anda,,-^0} : a*. 7^ and a,, = 0} 

TPR = . r, \ ; — ; and FFR = ,, \ r . 

#{(z,j):^.,7^0} ^{(t,j):a,j = 0} 

It follows from Theorem |2] directly that P{FPR = 0) -> 1 and P{TPR = 1) 1 under the 
conditions of the theorem. 

The next result shows that 5 = 2 is the optimal choice for support recovery in the sense 
that a thresholding estimator with any smaller choice of 6 would fail to recover the support 
of So exactly with probability going to one. We assume X satisfies the following condition 
which is weaker than the Gaussian assumption. 

(C3) Suppose that 

E[(X, - fi,f{X, - /i,)(Xfc - /ifc)] = 0, E[(X, - ix,){X, - iij){Xk - iik)iXi - fii)] = 
^iii2 = for all ji ^ 22 e {i, j, /}. 



Theorem 3 Let Xij = ry'^^^ with < r < 2. Suppose that (CI) or (C2) holds. Under 
(C3) and p = exp(o(n^/^)), if So{p) = 0{p^~'^^) with some < ri < 1 and p — >■ 00, then 

inf Pfsupp(s'(r)) ^ supp(So)) ^ 1. 

Remark 4 The condition p = exp(o(n^/^)) is used in the proof to make sure the covari- 
ances of the samples {X„} can be well approximated by normal vectors. It can be replaced 
hj p = exp(o(n^/^)) if X is a multivariate normal population. 

3.3 Comparison with universal thresholding 

It is interesting to compare the asymptotic results for adaptive thresholding estimator S (6) 
with the known results for universal thresholding estimators. We begin by comparing the 
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rate of convergence of S (5) with that of the universal thresholding estimator introduced 
in Bickel and Levina (2008) in the case of polynomial-type tails. Suppose that (C2) holds. 
Bickel and Levina (2008) showed that 

Op(so{p) r ^1/2 ) ) (20) 



n 



for So G Uq. Clearly, the convergence rate given in Theorem[T]for the adaptive thresholding 
estimator is significantly faster than that in f l2U]) . 

We next compare the rates over the class U*, < q < 1. For brevity, we shall focus 
on the Gaussian case X ~ N{fj,, Sq). The following theorem gives the lower bound of the 
universal thresholding estimator. 

Theorem 4 Assume thatn^'^ <p< exp(o(n^/'^)) andS < so{p) < m.m{p^^^ , A{n/ logpY^"^} . 
We have, as p oo, 

M sup Pht, - E„||. > |-4-.(p)f!2IP)'-"'^) ^ 1 (21) 



and hence for large n, 

inf sup E||S, - Soll^ > ^A-'\P)C^T' ■ (22) 

The rate in (12T|) is slower than the optimal rate SQ{p){\ogp/n)^^~''^^'^ given in (flGl) when 
■5o(p) — oo as p — !■ oo. Therefore no universal thresholding estimators can be minimax-rate 
optimal under the spectral norm over if so(p) — )■ oo. 

If we assume the mean of X is zero and ignore the term X in I]„, then the universal 
thresholding estimators given in Bickel and Levina (2008) and Rothman, Levina and Zhu 
(2009) use the sample mean of the samples {XkiXkj'., 1 <k < n} to identify zero entries in 
the covariance matrix. The support of these estimators depends on the quantities > 
A„}. In the high dimensional setting, the sample mean is usually unstable for non-Gaussian 
distributions with heavier tails. Non-Gaussian data can often arise from many practical 
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applications such as in finance and genomics. For our estimator, instead of the sample 

^1/2 

mean, we use the Student t statistic (^ij/Oij to distinguish zero and nonzero entries. Our 
support recovery depends on the quantities Ol-"^ > 2 A/log p/n} which are more 

stable than /{|(3"y| > A„}, since t statistic is much more stable than the sample mean; see 
Shao (1999) for the theoretical justification. 

4 Data-driven choice of S 

Section [3] analyzes the properties of the adaptive thresholding estimator with a fixed value 
of 6. Alternatively, 6 can be selected empirically through cross validation (CV). In Bickel 
and Levina (2008) the value of the universal thresholding level A„ is not fully specified 
and the CV method was used to select empirically. They obtained the convergence 
rate under the Frobenius norm for an estimator that is based only on partial samples. 
Theoretical analysis on the rate of convergence under the spectral norm is still lacking. In 
this section, we first briefly describe the CV method for choosing 6 and then derive the 
theoretical properties of the resulting estimator under the spectral norm. 

Divide the sample {X^; 1 < A; < n} into two subsamples at random. Let rii and 
n2 = n — rii he the two sample sizes for the random split satisfying ni x n2 n, and let 
S]^, S2 be the two sample covariance matrices from the f th split, for v = 1, . . . , H , where 
if is a fixed integer. Let (6) and (^) be defined as in (ITT!) from the vth split and 

v=l 

Let aj = j/N, < j < 4N he 4N + 1 points in [0, 4] and take 

6 = j/N, where j = a.Tgmm R{j /N), 

0<j<4:N 

where > is a fixed integer. If there are several j attain the minimum value, j is chosen 
to be the smallest one. The final estimator of the covariance matrix Sq is given by S (6). 
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Theorem 5 Suppose X ~ N(fi, Eq) with Sq G Uq and miiij (T° > tq /or some tq > 0. Lei 
so{p) = 0{{\ogpy) for some 7 < 1 and < p < exp(o(?7,^/^)) for some > 0. We have 

inf Pf||r(5) -S0II2 < Cso{p)C-^y) ^ 1. 

Remark 5 The assumption that N is fixed is not a stringent condition since we only 
consider 5 belonging to the fixed interval [0, 4]. Moreover, we will only focus on the matrices 
in Uq due to the complexity of the proof. Extending to the case N 00 with certain rate 
and more general So is possible. However, it requires far more complicated proof and will 
not be discussed in the present paper. 

Remark 6 The condition Sq{p) = 0((logp)''') used in the theorem is purely for technical 
reasons and we believe that it is not essentially needed and can be weakened. This condition 
is not stringent when p = exp(n") and it becomes restrictive if p = ©(n""). 

Similar to the fixed 6 case, we also consider support recovery with the estimator S (5). 

Proposition 2 Suppose the conditions in Theorem{^ hold. For 12 (6), we have 

FPR = Op{s^{p)/p) ^ 0. 

Moreover, since 6 < 4, we have TPR = 1 with probability tending to one if the lower bound 
in ( fTPj) holds with 2 + 6 being replaced by 6. 

5 Numerical Results 

The adaptive thresholding procedure presented in Section |2] is easy to implement. In this 
section, the numerical performance of the proposed adaptive thresholding estimator S (6) 
is studied using Monte Carlo simulations. Both methods for choosing the regularization 
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parameter 5 are considered and their performance are compared with that of universal 
thresholding estimators. The adaptive thresholding estimator is illustrated in an analysis 
on a dataset from a small round blue-cell tumors microarray experiment. 

5.1 Simulation 

The following two types of sparse covariance matrices are considered in the simulations to 
investigate the numerical properties of the adaptive thresholding estimator S (6) . 

• Model 1 (banded matrix with ordering) . Sq = diag(Ai, A2), where Ai = {<^ij)i<i,j<p/2, 
aij = ^1 — j , A2 = 4/p/2xp/2- So is a two-block diagonal matrix. Ai is a banded 
and sparse covariance matrix. A2 is a diagonal matrix with 4 along the diagonal. 

• Model 2 (sparse matrix without ordering). Sq = diag(Ai, A2), where A2 = ^Ip/2xp/2i 
Ai = B+eIp/2xp/2, B = {bij)p/2xp/2 with independent bij = unif(0.3, 0.8) xBer(l, 0.2). 
Here unif(0.3, 0.8) is a random variable taking value uniformly in [0.3, 0.8]; Ber(l, 0.2) 
is a Bernoulli random variable which takes value 1 with probability 0.2 and with 
probability 0.8; and e = max(— Ajnm(-B), 0) + 0.01 to ensure that Ai is positive defi- 
nite. 

Under each model, n = 100 independent and identically distributed p-variate random 
vectors are generated from the normal distribution with mean and covariance matrix Sq, 
for p = 30, 100, 200. In each setting, 100 replications are used. We compare the numerical 
performance between the adaptive thresholding estimators S (6) and = S (2) and 
with the universal thresholding estimator of Rothman, Levina and Zhu (2009). Here 
6 is selected by five fold cross-validation in Section HI S2 is the adaptive thresholding 
estimator with fixed 6 = 2. The thresholding level A„ in is selected by five fold cross- 
validation method used in Bickel and Levina (2008). For each procedure, we consider two 
types of thresholding functions, the hard thresholding and the adaptive lasso thresholding 

17 



Sx{z) = x(l — \X/x\^) with rj = 4. The losses are measured by three matrix norms: the 
spectral norm, the matrix ii norm and the Frobenius norm. We report in Tables [1] and [2] 
the means and standard errors of these losses. We also carried out simulations with the 
SCAD thresholding function for both universal thresholding and adaptive thresholding. 
The phenomenon is very similar. The SCAD adaptive thresholding also outperforms the 
SCAD universal thresholding. For reasons of space, the results are not reported here. 

Table 1: Comparison of average matrix losses for Model 1 over 100 replications. The 
standard errors are given in the parentheses. 







Adaptive lasso 




Hard 






p 






±1 
















Operator 


norm 








30 


3.53(0.13) 


1.72(0.05) 


2.39(0.07) 


3.50(0.14) 


1.77(0.05) 


1.77(0 


04) 


100 


7.94(0.11) 


2.72(0.05) 


4.68(0.06) 


8.64(0.07) 


2.57(0.05) 


3.04(0 


05) 


200 


8.95(0.004) 


3.23(0.05) 


5.70(0.05) 


8.95(0.004) 


3.02(0.05) 


3.77(0 


05) 








Matrix £i 


norm 








30 


5.29(0.15) 


2.57(0.08) 


3.34(0.09) 


5.71(0.15) 


2.60(0.09) 


2.70(0 


06) 


100 


9.03(0.05) 


4.15(0.07) 


6.39(0.09) 


9.24(0.03) 


4.17(0.07) 


4.87(0 


09) 


200 


9.35(0.01) 


4.90(0.07) 


7.64(0.07) 


9.35(0.01) 


4.89(0.07) 


5.97(0 


09) 








Frobenius 


norm 








30 


5.97(0.10) 


3.15(0.05) 


3.68(0.05) 


6.58(0.09) 


3.29(0.05) 


3.29(0 


04) 


100 


15.93(0.12) 


6.57(0.05) 


8.92(0.06) 


16.88(0.03) 


6.79(0.06) 


7.53(0 


05) 


200 


24.23(0.01) 


9.62(0.05) 


14.20(0.07) 


24.24(0.01) 


9.97(0.06) 


11.68(0 


05) 
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Table 2: Comparison of average matrix losses for Model 2 over 100 replications. The 
standard errors are given in the parentheses. 







Adaptive lasso 




Hard 






p 






z 


^9 














Operator 


norm 








30 


1.48(0.02) 


1.24(0.03) 


1.19(0.03) 


1.50(0.02) 


1.25(0.03) 


1.21(0 


03) 


100 


5.31(0.01) 


2.82(0.05) 


4.71(0.03) 


5.31(0.01) 


2.69(0.05) 


3.97(0 


04) 


200 


10.74(0.01) 


6.78(0.08) 


10.52(0.02) 


10.74(0.01) 


6.58(0.10) 


10.04(0 


03) 








Matrix d 


norm 








30 


1.70(0.03) 


1.33(0.04) 


1.22(0.03) 


1.70(0.02) 


1.32(0.04) 


1.24(0 


03) 


100 


6.16(0.01) 


4.10(0.05) 


5.52(0.03) 


6.16(0.01) 


4.20(0.06) 


5.22(0 


03) 


200 


12.70(0.01) 


9.81(0.08) 


12.31(0.04) 


12.70(0.01) 


10.06(0.08) 


12.06(0 


04) 








Frobenius 


norm 








30 


4.08(0.03) 


2.52(0.04) 


2.57(0.04) 


4.10(0.03) 


2.50(0.04) 


2.45(0 


04) 


100 


12.77(0.01) 


7.57(0.05) 


10.96(0.04) 


12.78(0.02) 


8.07(0.06) 


10.00(0 


05) 


200 


25.51(0.01) 


16.94(0.07) 


24.67(0.03) 


25.52(0.01) 


18.69(0.07) 


24.05(0 


03) 
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Under Model 1 and Model 2, both adaptive thresholding estimators S {6) and 
uniformly outperform the universal thresholding rule Sg significantly, regardless which 
thresholding function or which loss function is used. Between S (6) and S2, S (6) performs 
better than general. Between the two thresholding functions, the hard thresholding 

rule outperforms the adaptive lasso thresholding rule for S2, while the difference is not 
significant for S (6). For both models, the behaviors of hard and adaptive lasso universal 
thresholding rules are very similar. They both tend to "over-threshold" and remove many 
nonzero off-diagonal entries of the covariance matrices. 

For support recovery, again both S (6) and outperform S^. The values of TPR 
and FPR based on the off-diagonal entries are reported in Tables [3] and HI For Model 1, 
Tig tends to estimate many nonzero off-diagonal entries by zero when p is large. To better 
illustrate the recovery performance elementwise for the two models, the heat maps of the 
nonzeros identified out of 100 replications when p = QO are pictured in Figures [1] and [2l The 
heat maps suggest that the sparsity patterns recovered by S (5) and have significantly 
better resemblance to the true model than S^. 

Table 3: Comparison of support recovery for Model 1 over 100 replications. 



Adaptive lasso 




Hard 




P ^9 




±1 


^9 


±\6) 


±1 


30 TPR 0.57 


0.84 


0.72 


0.46 


0.79 


0.72 


FPR 0.07 


0.01 


0.00 


0.05 


0.003 


0.00 


100 TPR 0.15 


0.76 


0.57 


0.008 


0.69 


0.57 


FPR 0.01 


0.01 


0.00 


0.00 


0.00 


0.00 


200 TPR 0.00 


0.73 


0.51 


0.00 


0.65 


0.51 


FPR 0.00 


0.003 


0.00 


0.00 


0.00 


0.00 



20 



Table 4: Comparison of support recovery for Model 2 over 100 replications. 



Adaptive lasso 




Hard 




P ^9 


±\6) 


±1 




±\6) 


±1 


30 TPR 0.02 


0.95 


0.88 


0.00 


0.91 


0.88 


FPR 0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


100 TPR 0.00 


0.80 


0.33 


0.00 


0.66 


0.33 


FPR 0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


200 TPR 0.00 


0.68 


0.09 


0.00 


0.49 


0.09 


FPR 0.00 


0.01 


0.00 


0.00 


0.00 


0.00 



Model 1 




(d) tl (e) Eg (Hard) (f) Sg(Adap.lasso) 



Figure 1 : Heat maps of the frequency of the zeros identified for each entry of the covariance 
matrix (when p = 60) out of 100 replications. White color is 100 zeros identified out of 100 
runs, and black is 0/100. 
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Model 2 




(d) ±1 (e) Eg (Hard) (f) Sg(Adap.lasso) 



Figure 2: Heat maps of the frequency of the zeros identified for each entry of the covariance 
matrix (when p = 60) out of 100 rephcations. White color is 100 zeros identified out of 100 
runs, and black is 0/100. 
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5.2 Correlation analysis on real data 

We now apply the adaptive thresholding estimator S (6) to a dataset from a small round 
blue-cell tumors (SRBC) microarray experiment (Khan et al., 2001) and compare the ability 
of support recovery with that of the universal thresholding estimator S^. The estimator 
S2 is not considered here since the simulation results in Section 15.11 show that S (6) 
outperforms when the sample size is not large. The SRBC data set has been analyzed in 
Rothman, Levina and Zhu (2009) in which the universal thresholding rules were considered. 
To make the results comparable, we shall follow the same steps as those in Rothman, Levina 
and Zhu (2009). 

The SRBC data has 63 training tissue samples, and 2308 gene expression values recorded 
for each sample. The original data has 6567 genes and was reduced to 2308 genes after 
an initial filtering; see Khan et al. (2001). The 63 tissue samples contain four types of 
tumors (23 EWS, 8 BL-NHL, 12 NB, and 20 RMS). As in Rothman, Levina and Zhu 
(2009), the genes were first ranked by the amount of discriminative information based on 
the F-statistic, 

where n = 63 is the sample size, k = 4 is the number of classes, n^, 1 < m < 4, are 
the sample sizes of the four types of tumors, Xm and a^, are the sample mean and sample 
variance of the class m, and x is the overall sample mean. According to the F values, the 
top 40 and bottom 160 genes were chosen. The first 40 genes were also ordered according 
to the ordering given in Rothman, Levina and Zhu (2009). Based on the 200 genes, the 
performance of the two estimators S (6) and Sg was considered. The tuning parameters 
6 and A„ were selected by five fold cross validation. To this end, we need to divide the 
63 samples into five groups of nearly equal sizes. As there are four types of tumors in the 
samples, we let the proportions of the four types of tumors in each group be nearly equal 
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so that each fold is a good representative of the whole. Three fold cross validation is also 
used in this way and the results are similar. 




(c) Eg Hard (97.88% zeros) (d) tg AL (97.88% zeros) 

Figure 3: Heatmaps of the estimated supports. 



Figure |3] plots the heat maps of S (5) with hard thresholding (S (5) Hard), with 
hard thresholding (S^ Hard), S (6) with adaptive lasso thresholding (S (5) AL), with 
adaptive lasso thresholding (Sg AL). AL and S^, Hard result in very sparse estimators, 
with 97.88% zero elements in off diagonal positions. The estimator S (5) AL is the least 
sparse one with 69.78% zeros, while S (6) Hard has 83.11% zeros. The "over-threshold" 
phenomenon in the real data analysis is consistent with that observed in the simulations. 
The universal thresholding rule removes many nonzero off diagonal entries and results in an 
"over-sparse" estimate, while adaptive thresholding with different individual levels results 
in a clean but more informative estimate of the sparsity structure. 
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6 Discussion 



This paper introduces an adaptive entry-dependent thresholding procedure for estimating 
sparse covariance matrices. The proposed estimator E (5) = (a* ) enjoys excellent perfor- 
mance both theoretically and numerically. In particular, S (6) attains the optimal rate of 
convergence over U* given in ([7]) while universal thresholding estimators are shown to be 
sub-optimal. The main reason that universal thresholding does not perform well is that 
the sample covariances can have a wide range of variabilities. A simple and natural way 
to deal with the heteroscedasticity is to first estimate the correlation matrix Rq and then 
renormalize by the sample variances to obtain an estimate of the covariance matrix. We 
shall discuss below two approaches based on this idea. 

Denote the sample correlation matrix by i? = {rij)i<ij<p with fij = aij/ ^/oaajj. An 
estimate of the correlation matrix i?o can be obtained by thresholding fjj. Define the 
universal thresholding estimator of the correlation matrix by i?(A ^) — i,^ij^)pxp with 

and the corresponding estimator of the covariance matrix by ±r = D]I^R{\n)D]l^ , where 
Dn = diag(S„). It is easy to see that a good choice of the threshold is A„, = C {log p)/n 
for some constant C > 0. It is however difficult to choose C because the choice depends 
on the unknown underlying distribution. Assuming the constant C is chosen sufficiently 
large, it can be shown that the resulting estimator S/; attains the same minimax rate of 
convergence. However, the estimator S^j is less efficient than S (6) for support recovery. 
In fact, Sfl is unable to recover the support of Sq exactly for a class of non-Gaussian 
distributions of X. Denote by V(7, 6, Ki) the class of distributions F of X satisfying the 
conditions of Theorem O Then it can be shown that for any 7 > 0, 5 > 2 and some 
K, = i^i(7) > 0, 

inf sup pfsupp(Sij) ^ supp(i:o)) ^ 1. (23) 
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The sample correlation coefficients fj^ are not homoscedastic, although the range of 
variabilities is smaller in comparison to that of sample covariances. This is in fact the 
main reason for the negative result on support recovery given in Equation (!23l) . A natural 
approach to deal with the heteroscedasticity of the sample correlation coefficients is to first 
stabilize the variance by using Fisher's ^-transformation, then threshold and finally obtain 
the estimator by inverse transform. Applying Fisher's 2;-transformation to each correlation 
coefficient yields 

= - In — iLji. 

' 2 1 - hj 

When X is multivariate normal, it is well-known that is asymptotically normal with 
mean (l/2)ln((l +rjj)/(l — Tij)) and variance l/(n — 3). The behavior of Zij in the non- 
Gaussian case is more complicated. In general, the asymptotic variance of Zij depends on 
EXfX'j even when = 0; see Hawkins (1989). Similar to the method of thresholding the 
sample correlation coefficients discussed earlier, universally thresholding (Zij)pxp is unable 
to recover the support of Sq exactly for a class of non-Gaussian distributions of X satisfying 
the conditions in Theorem [2l 

In conclusion, the two natural approaches based on the sample correlation matrix dis- 
cussed above are not as efficient as the entry-dependent thresholding method we proposed 
in Section [2l For reasons of space, we omit the proofs of the results stated in this section. 
We shall explore these issues in detail elsewhere. 



7 Proofs 

We begin by collecting a few technical lemmas which are essential for the proofs of the main 
results. The ffist lemma is an exponential inequality on the partial sums of independent 
random variables. 
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Lemma 1 Let ^i, ■ ■ ■ be independent random variables with mean zero. Suppose that 
there exists some t > and i?„ such that Ylk=i ^^fc^*'^*"' — -^n- Then for < x < Bn, 

n 

> CtBnX^ < exp(-x2), (24) 

k=l 

where Ct = t + t~^. 

Proof of Lemma [1]. By the inequality |e* — 1 — s| < s2gsmax(s,o)^ have for any t > 0, 

n n 

P(5^6 >Ci^5n^) < exp{-tC.,B.^x)Y[Eexp{t^k) 

k=l k=l 

n 

< exp(-tC,5„x) J](l + t2E4Vl^'^l) 



k=l 

n 



< exp ( - tCr^Br^x + t'Ee^e*'«'=') . 

fc=i 

Take t = ri{x / Bn). It follows that 

n 

P(^^6 > Cr,BnX^ < exp - r]CrjX^ + ?7Vj = exp(-x^), 

k=l 

which completes the proof. | 

The second and third lemmas are on the asymptotic behaviors of the largest entry of 
the sample covariance matrix and 6ij. The proof of Lemma[2]is given in Cai and Liu (2010). 

Lemma 2 (i). Under (CI), we have for any S > 2, e > and M > 0, 

p(max|<T,, - a^^\/eiP > d^^^gpF) = O ((logp)-i/V'+') , (25) 

P(max{|% - ^,,1} > ealal) = 0{p~^), (26) 



and 



P(^max \X'\ > C ^ log p/n^ = 0{p-^') (27) 



27 



for some C > 0. 

(ii). Under (C2), still hold if we replace O (^(logp)"^/^^''^^) and 0{p~^'^) 

with Oi (logp)~^/^p~^~'"^ + n~^/^ ) and 0{n~^/^) respectively. 



Lemma 3 Let X = (Xi, ■ ■ ■ ,Xp) be a mean zero random vector. Suppose that Cov{X) 
Ipxp, (C3) holds and p oo. Then under (CI) or (C2), we have for any 5 > 0, 

i<j<p 



" 2 

P( max V X^iX^j > (4 - 5) logp) ^ 1. 



k=l 



Proof of Lemma [31 We arrange the two dimensional indices {{i,j) : 1 < i < j < p} in 
any ordering and set them as {{im,jm) : 1 < rra < p{p — l)/2 =: L}. Let 



Yhm. — 6,^ ^ Xhi_X, 



S^ = n-'/^J2Yk^, = {\S^\ > y/ {A -5) logp}, l<m<L. 



k=l 



Define Ykm = Yk,nI{\Ykm\ < Sn^/n/(logp)'•^} and = Yfcm, - EYfc„, where (5„ 
sufficiently slow. Then by (CI) or (C2) we have when n is large, 

2 

nOi^) I > XkiXkA 

i<j<p 

fc=i 

-M 



n 2 

P( max VXfc,Xfc,- > (4 -5) logp) 

V l<i<j<p I / 



> P( ^max^n-^l ^ffc^l > (4 -25) logp) - 0{p-^ + n 
fc=i 

2 



> P( max n 7 

l<m<L I 
~ ~ k=l 



km 



> 41ogp — loglogp + X ) — 0(p +n 



-M ,-e/8^ 



(28) 



for any M > and x < 0. Set ?/„ = v^41ogp — log logp + x and 



n 

Am = > 



fc=i 

Then by Bonferroni's inequality, we have for any fixed /, 

fe=l d=l l<ii<-<id<L j=l 

Write 



(29) 
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By Theorem 1 in Zaitsev (1987), we have 



P(|N|,,oo >2/n-^y'(logp)"V2j +ciexp(-C25-^/Mogp) 

n 

fc=i 



(30) 



where ci and C2 are positive constant depending only on d, \-\d,oo means |a|d,oo = niini<j<d |aj| 
for a = (oi, ■ ■ ■ , ad)i and N is a (i dimensional normal random vector with mean zero and 
covariance matrix Cov(Yfc). Set 



B 



|NU,oo >Z/nT5y'(l0gp)-^ 



/2 



We can check that ||Cov(Nfc) — Idxd\\2 = 0{l/(logp)^). Let Z be a standard d- dimensional 
normal vector. Then we have 



< P |ZL^> 



d,oo _ Vn 



/2 



fP(||Cov(Nfc) -/,,,||2|Z|2 > <5y2(iogp)-i/2' 
;i + 0(1)) (4=^"' exp(-x/2))' + 0(exp(-C(logp)2)). 



(31) 



> (1 - 0(1)) ( ^p-2exp(-x/2))' - 0(exp(-C(logp)2)). 



Similarly we can get 

Submitting ([22])- ([32]) into we can get 

hm P( max n-^|X^n„'>y^) > X](-l)'^^i exp(-x/2)) '/rf! 



(32) 
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k=l 



— )• 1 — cxp ( ^=exp(— x/2) 

as / — oo. Letting x — t- — oo, we prove the lemma by (128!) and (133|) . | 



(33) 
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Proof of Theorem [H By (CI) or (C2), we have % < CK^cTuC^jj. On the event 

{maxjj |(3"ij — | < Xij} fl < 29ij for aU i,j}, we have by the conditions (i)-(iii) 
on Sx{z) that 
p 



^0 I 



p p 

< 2^^ A,,/{|4 1 > A,,} + 5^ - 4 > |a° | < A,,} 

+ El4|/{I4I<2A..} 

< 2 Aj-'^Kr + (1 + c) < A.,} + wmwu < 

j=l j=l j=l 

dogp\(i-'?)/2 



< C'Ki,5,c,gSo(p) 



n 



The proof follows from Lemma [2] and the fact ||A||2 < ||^||li for any symmetric matrix. 



Proof of Theorems 2 and [3l Theorem 2 follows from Lemma [2] immediately. We now 
prove Theorem [3l For each 1 < i < p, let Ai be the largest subset of {1, ■ ■ ■ ,p} such 
that Xi is uncorrelated with {Xk,k G Ai}. Let ii =argmin{|j — z| : j G Ai}. Then we 
have |zi — z| < s. Also, Card(y4i) > p — s. Similarly, let Ai be the largest subset of Ai^i 
such that Xjj j is uncorrelated with {X^, k G Ai} and =argmin{|j — : j G Ai}. We 
can see that \ii — i\ < Is and Card(A/) >Card(A/_i) — s > p — si. Take / = [p'^'^] with 
< T2 < min(r^/3, Ti). Then Xi^, . . . ,Xi^ are pairwise uncorrelated random variables, 
where we set io = i. Clearly ii,--- ,ii G Bi = {j : (T°- = 0;j ^ i). Without loss of 
generality, we assume that Xi, ■ ■ ■ , X; are pairwise uncorrelated. Note that \s\{z) \ > \z\—X. 

30 



It suffices to show that for some Eq > 0, 

P(^maxJA-^^.|(T,,|} > 1 + £o) ^ 1. (34) 

Clearly, we can assume EX = and Var(Xi) = 1 for 1 < i < /. By Lemma [2] and (iMl) . we 
have miujj A„jj > with probability tending to one. By Lemma [2] it suffices to show that 
for any < r < 2, 

n 

An := p(meiX^l{ne,,y'/^\ J^X^X^, } > r^/I^) ^ 1. (35) 

k=i 

Since logp < (4—5) log / for < 6 < 4— r^/ T2 and large n, fl5Sl) follows from Lemma 3. | 

Lemmas m and [5] below, proved in Cai and Liu (2010), are needed to prove Theorems H] 
andO 

Lemma 4 Suppose that X ~ A^(/x, Sq) with So £ ^^q. i^ei so(p) = 0((logp)'^) for some 
7 < 1 and < p < exp(o(n^/^)) for some ^ > 0. Let 6 > \/2. Then there are at most 
O(so(p)) nonzero elements in each row ofS (6). Furthermore, 

Togp\ 

ri yo) - z^o\\2^ o^,5,m max 0-^^60 I 
for any M > 0, where C^^s,m is a constant depending only on 7, S, M, and 



inf_ P(\\±\6) - S0II2 < C,,s,M^^^iMp)C-^Y') ^ 1 - 0(P~^'^ (36) 



sup E||S^(5)-So||^<Csg(p)i^ (37) 



/or some constant C > 0. 



Lemma 5 Lei Ajj = ry tL'^i/i < r < \/2. Under the conditions of Lemma 4, 

P(min ^ > A„,,(r)} > ^ 1 (38) 

with any eo < (1 — r^/2)/2, where Bi = {j : = 0;j 7^ i}. Hence for some constant 
C>0, 

inf P(\\t\T) - S0II2 > Cmina°p^o/2^o(p)f^)'^') ^ 1- 
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Proof of Theorem [4l To simplify the notation, we shall write sq for sq{p). We construct 
a matrix Sq G U^. Let si = [{sq - iy~''{\ogp/ny''/^] + 1 and (Xi, ■ ■ ■ , X,J, X,^+i, ■■■ ,Xp 
be independent. Let cr° = sq for all i > si, a^^ = 1 for 1 < i < si and erf- = A^^so^/log p/ 



n 



for 1 < i 7^ j < si. Note that = for i 7^ j > si. Since sq < 4:^/n/\ogp, Sq is a positive 
definite covariance matrix belonging to U*. Set iVf„ = (o"fj)i<j,j<si- We first suppose that 



A„ < 3 ^app^/2\ogp/n. Lemma |5] yields 

j=Sl + l 

with any eo < 3/8. Take eo = 7/20 and note that p^^^ > sq, > n^/^. By the inequality 
\s\{z)\ >z-X, 

inf .upP(||E,-E„||.>^.?(p)(!^)'-"'Vl- (39) 

We next consider the case A„, > 3"^crppA/2 logp/n. We have 

IIS3-S0II2 > ||M„-M„||2, 
where iVf„ = ('5'fj)i<ij<si. As in Lemma 2, we can get for any 7 > 

Pf max - a° I > y^2^ih^^) < Csj{\ogp)-^/^p-\ 
Taking 7 = 1, we have with probability tending to one, maxi<j<j<sj \aij\ < (4~^so + 



\/2)^y\og p/n, which implies that afj = for 1 < i 7^ j < si. Thus, with probability 
tending to one. 



II A> II ^^^-1 -lA ^Ogp 3 2~gf^0gp\i^-l)/^ 

This and §9i) together imply | 



Proof of Theorem [5] and Proposition [2l For brevity, we only consider the case H = 1. 
The proof for general H is similar. We first show that for any e > 0, 

p(^6>V2-e^^l. (40) 
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Since the random split is independent with the sample {Xi, ■ ■ ■ , X„}, we can assume that 
the two samples are {Xi, ■ ■ ■ , X„j} and {X„j+i, ■ ■ ■ , X^}. Let XI2 be the sample covariance 
matrix from {X„^+i, ■ ■ ■ , X„} and be defined as in ( ITT]) from {Xi, ■ ■ ■ , X.„^}. Define 

L = jo/N, where jo = argmin ||i;*(j/A^) - SqIIf- 

0<j<4N 

Set a„, = II (5) — SqIIf and r„ = p~'^\\'E^{6o) — '^oW'f- By the proof of Theorem 1, we 
have P^||S]^(2) — SqUli < CiSo(p)(logp/n)-'^/^ j — )■ 1 for some Ci > 0. Using the inequality 
p~"'^|| A|||n < |A|oo|| A||ij for any pxp symmetric matrix A and the definition of 60, we have 
P(^^n < C2So{p) log p/n^ — 1 for some C2 > 0. Note that 

E|(F,S2-So)|'<Cn-i 

for any p x 1 vector V with || = 1. By the proof of Theorem 3 in Bickel and Levina 
(2008) and the assumption that N is fixed, we can see that, 

an < Op{^)ai/' + 0,{^yj^ + r.. (41) 

Hence for some C3 > 0, 

P{an < C3So(p) logp/nj ^ 1. (42) 

Note that by applying Lemma 5 to the samples {Xi, ■ ■ ■ , X„^}, 

P(a„, < C3So{p)\ogp/n,5 < V2- = o{l). 

This together with ( l42l) shows that 

P(6 <V2-s^ < P(6 < V2 - e,an < C-sSoip)logp/n^ +o(l) = o(l), 

and hence ( liOj) holds. Since is fixed, we have \a — \/2\ > Eq for some fixed Eq > which 
depends on A^. This together with fHOj) implies 

p(^6>V2 + e^^l (43) 
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for some e > 0. By Lemma 4, we see that with probabihty tending to one, for each i, there 
are at most 0{sq{p)) nonzero numbers of {\sx^j{cTij)\; j G Bi} and by Lemma 2, they are of 
order O (maxj cr^i\/logp/ n) . Let = {j : cr°- 0} and = {j : a*- 0}. Then by the 
conditions on sx{z), we have 



with probabihty tending to one. The proof of Theorem |5]is completed. Finahy, Proposition 
[2] is proved by fH3|l . Lemmas 2 and 4. | 

8 Supplemental materials 

Additional proofs: A supplement to the main paper contains additional technical argu- 
ments including the proofs of Lemmas 2, 4 and 5. (pdf file) 
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